This tutorial requires the following R packages; install and load them if you have not yet:
if (!require("pacman")) {install.packages("pacman"); library(pacman)}
## Loading required package: pacman
pacman::p_load("moments", "ggplot2", "knitr", "printr", "plotly", "scales", "readr")
Question: What is the type for each of the variables in the TOD dataset?
californiatod <- read_csv("californiatod.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## region = col_character(),
## transit = col_double(),
## density = col_double(),
## houseval = col_double(),
## railtype = col_character()
## )
kable(californiatod)
| name | region | transit | density | houseval | railtype |
|---|---|---|---|---|---|
| Archstone Barrington Hills Apartments | Bay Area | 0.4000 | 3.36 | 192593.26 | Heavy or commuter rail |
| Archstone Mission Va | SD | 0.1077 | 2.18 | 203054.65 | Light rail |
| Atherton Place | Bay Area | 0.5000 | 5.21 | 217671.52 | Heavy or commuter rail |
| Bellamar Apartments | LA | 0.3178 | 10.05 | 117255.94 | Light rail |
| Coggins Square | Bay Area | 0.5278 | 4.68 | 270554.72 | Heavy or commuter rail |
| Del Norte Place | Bay Area | 0.6429 | 4.91 | 229335.59 | Heavy or commuter rail |
| Iron Horse Lofts | Bay Area | 0.6061 | 4.68 | 270207.40 | Heavy or commuter rail |
| Mercado Apartments | SD | 0.0000 | 3.21 | 121785.96 | Light rail |
| Mission Wells | Bay Area | 0.1475 | 3.96 | 323496.98 | Heavy or commuter rail |
| Northpark Apartments | Bay Area | 0.1738 | 3.27 | 667661.04 | Heavy or commuter rail |
| Ohlone-Chynoweith | Bay Area | 0.0000 | 4.00 | 357694.13 | Light rail |
| Pacific Court | LA | 0.0000 | 10.33 | 118401.77 | Light rail |
| Palo Alto Condos | Bay Area | 0.0667 | 4.78 | 779791.98 | Heavy or commuter rail |
| Park Regency | Bay Area | 0.3178 | 4.71 | 271728.59 | Heavy or commuter rail |
| Poinsettia Station | SD | 0.1538 | 1.72 | 273004.39 | Heavy or commuter rail |
| Studio Village | LA | 0.1875 | 8.70 | 246330.40 | Heavy or commuter rail |
| The Crossings | Bay Area | 0.1316 | 6.49 | 558643.30 | Heavy or commuter rail |
| Union Square Condos | SD | 0.1892 | 3.87 | 192739.76 | Light rail |
| Verandas Apartments | Bay Area | 0.4912 | 3.15 | 327738.97 | Heavy or commuter rail |
| Villages of LaMesa | SD | 0.1111 | 3.16 | 207159.11 | Light rail |
| Wayside Plaza | Bay Area | 0.4737 | 4.67 | 270087.94 | Heavy or commuter rail |
| Western Carlton | LA | 0.2100 | 14.85 | 371469.76 | Heavy or commuter rail |
| Wilcox Apartments | LA | 0.3333 | 14.76 | 288948.71 | Heavy or commuter rail |
| Wilshire Promenade | LA | 0.1053 | 3.84 | 216780.07 | Heavy or commuter rail |
| Windsor Ridge | Sacramento | 0.2222 | 2.36 | 152840.33 | Light rail |
| Woodlake Close | Sacramento | 0.1579 | 2.46 | 89045.74 | Light rail |
Descriptive Stats for a single variable
Common quantitative descriptive measures include mean, standard deviation (variance), median, IQR Skewness and Kurtosis are measures comparing the distribution of the variable of interest to a normal distribution.
mean(californiatod$density)
## [1] 5.36
sd(californiatod$density)
## [1] 3.531352
#range
max(californiatod$density) - min(californiatod$density)
## [1] 13.13
# robust measures
median(californiatod$density)
## [1] 4.335
IQR(californiatod$density)
## [1] 1.91
# Or use package like skimr
library(skimr)
##
## Attaching package: 'skimr'
## The following object is masked from 'package:knitr':
##
## kable
skim(californiatod)
| variable | type | stat | level | value | formatted |
|---|---|---|---|---|---|
| name | character | missing | .all | 0.000000e+00 | 0 |
| name | character | complete | .all | 2.600000e+01 | 26 |
| name | character | n | .all | 2.600000e+01 | 26 |
| name | character | min | .all | 1.200000e+01 | 12 |
| name | character | max | .all | 3.700000e+01 | 37 |
| name | character | empty | .all | 0.000000e+00 | 0 |
| name | character | n_unique | .all | 2.600000e+01 | 26 |
| region | character | missing | .all | 0.000000e+00 | 0 |
| region | character | complete | .all | 2.600000e+01 | 26 |
| region | character | n | .all | 2.600000e+01 | 26 |
| region | character | min | .all | 2.000000e+00 | 2 |
| region | character | max | .all | 1.000000e+01 | 10 |
| region | character | empty | .all | 0.000000e+00 | 0 |
| region | character | n_unique | .all | 4.000000e+00 | 4 |
| transit | numeric | missing | .all | 0.000000e+00 | 0 |
| transit | numeric | complete | .all | 2.600000e+01 | 26 |
| transit | numeric | n | .all | 2.600000e+01 | 26 |
| transit | numeric | mean | .all | 2.528808e-01 | 0.25 |
| transit | numeric | sd | .all | 1.904846e-01 | 0.19 |
| transit | numeric | p0 | .all | 0.000000e+00 | 0 |
| transit | numeric | p25 | .all | 1.162250e-01 | 0.12 |
| transit | numeric | p50 | .all | 1.883500e-01 | 0.19 |
| transit | numeric | p75 | .all | 3.833250e-01 | 0.38 |
| transit | numeric | p100 | .all | 6.429000e-01 | 0.64 |
| transit | numeric | hist | .all | NA | ▅▇▆▂▂▁▃▂ |
| density | numeric | missing | .all | 0.000000e+00 | 0 |
| density | numeric | complete | .all | 2.600000e+01 | 26 |
| density | numeric | n | .all | 2.600000e+01 | 26 |
| density | numeric | mean | .all | 5.360000e+00 | 5.36 |
| density | numeric | sd | .all | 3.531352e+00 | 3.53 |
| density | numeric | p0 | .all | 1.720000e+00 | 1.72 |
| density | numeric | p25 | .all | 3.225000e+00 | 3.23 |
| density | numeric | p50 | .all | 4.335000e+00 | 4.34 |
| density | numeric | p75 | .all | 5.135000e+00 | 5.13 |
| density | numeric | p100 | .all | 1.485000e+01 | 14.85 |
| density | numeric | hist | .all | NA | ▇▇▂▁▁▂▁▂ |
| houseval | numeric | missing | .all | 0.000000e+00 | 0 |
| houseval | numeric | complete | .all | 2.600000e+01 | 26 |
| houseval | numeric | n | .all | 2.600000e+01 | 26 |
| houseval | numeric | mean | .all | 2.821547e+05 | 282154.69 |
| houseval | numeric | sd | .all | 1.630922e+05 | 163092.19 |
| houseval | numeric | p0 | .all | 8.904574e+04 | 89045.74 |
| houseval | numeric | p25 | .all | 1.953185e+05 | 2e+05 |
| houseval | numeric | p50 | .all | 2.582092e+05 | 258209.17 |
| houseval | numeric | p75 | .all | 3.148599e+05 | 314859.91 |
| houseval | numeric | p100 | .all | 7.797920e+05 | 779791.98 |
| houseval | numeric | hist | .all | NA | ▅▇▇▂▁▁▁▁ |
| railtype | character | missing | .all | 0.000000e+00 | 0 |
| railtype | character | complete | .all | 2.600000e+01 | 26 |
| railtype | character | n | .all | 2.600000e+01 | 26 |
| railtype | character | min | .all | 1.000000e+01 | 10 |
| railtype | character | max | .all | 2.200000e+01 | 22 |
| railtype | character | empty | .all | 0.000000e+00 | 0 |
| railtype | character | n_unique | .all | 2.000000e+00 | 2 |
# skewness & kurtosis (from the moments package)
skewness(californiatod$density)
## [1] 1.626587
kurtosis(californiatod$density)
## [1] 4.732171
Histogram is a graphical representation of the distribution of Numeric variables.
require(ggplot2)
qplot(density, data=californiatod, geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# The default bin width is set to range/30 in qplot and it doesn't look good. Let's set it to 2
# You can experiment it with different values
qplot(density, data=californiatod, geom="histogram", binwidth=2)
# show percentage
hist.percent <- ggplot(californiatod, aes(x=density)) +
geom_histogram(aes(y=(..count..)/sum(..count..)), binwidth = 2) +
scale_y_continuous(labels=percent) + labs(y="")
hist.percent
# with normal curve superimposed
hist.percent + stat_function(fun=dnorm, args=list(mean=mean(californiatod$density), sd=sd(californiatod$density)))
qplot('all', y=density, data=californiatod, geom="boxplot")
table(californiatod$region) # counts
| Bay Area | LA | Sacramento | SD |
|---|---|---|---|
| 13 | 6 | 2 | 5 |
options(digits=2) # only show 2 decimal places
prop.table(table(californiatod$region)) #share
| Bay Area | LA | Sacramento | SD |
|---|---|---|---|
| 0.5 | 0.23 | 0.08 | 0.19 |
qplot(region, data=californiatod, geom="bar") # counts
# in percentage scale
ggplot(californiatod, aes(x=region)) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels=percent) + labs(y="")
Quantitative measures for the relationship between a pair of variables will be covered later. We will focus on visualization here.
qplot(x=density, y=houseval, data=californiatod, geom="point")
qplot(x=transit, y=houseval, data=californiatod, geom="point")
qplot(density, houseval, data=californiatod, geom="point") + geom_smooth(method=lm, se=FALSE)
qplot(transit, houseval, data=californiatod, geom="point") + geom_smooth(method=lm, se=FALSE)
plot.int <- ggplot(californiatod, aes(transit, houseval, text=paste("name:", name))) + geom_point()
ggplotly(plot.int)
qplot(railtype, houseval, data=californiatod, geom="boxplot")
qplot(region, houseval, data=californiatod, geom="boxplot")
# Bar plot, but with less information (only mean)
ggplot(californiatod, aes(railtype, houseval)) + geom_bar(stat = "identity")
ggplot(californiatod, aes(region, houseval)) + geom_bar(stat="identity")
# We can try scatter plot even though not ideal
qplot(railtype, houseval, data=californiatod, geom="point")
qplot(region, houseval, data=californiatod, geom="point")
qplot(transit, railtype, data=californiatod, geom="point")
qplot(density, railtype, data=californiatod, geom="point")
qplot(region, fill=railtype, data=californiatod, geom="bar")
# show percentage
ggplot(californiatod, aes(x=region, fill=railtype)) +
geom_bar(aes(y=(..count..)/sum(..count..))) +
scale_y_continuous(labels=percent) + labs(y="")
# compare with scatter plots, not ideal for working with categorical variables, even when the data points are jittered
qplot(region, railtype, data=californiatod, geom="point")
qplot(region, railtype, data=californiatod, geom="point") + geom_jitter()
table(californiatod$railtype, californiatod$region) # counts
| / | Bay Area | LA | Sacramento | SD |
|---|---|---|---|---|
| Heavy or commuter rail | 12 | 4 | 0 | 1 |
| Light rail | 1 | 2 | 2 | 4 |
prop.table(table(californiatod$railtype, californiatod$region), 2) * 100 #percentage
| / | Bay Area | LA | Sacramento | SD |
|---|---|---|---|---|
| Heavy or commuter rail | 92.3 | 67 | 0 | 20 |
| Light rail | 7.7 | 33 | 100 | 80 |
Now we will do something a bit more complicated. We will create a new two-way crosstab like you did above but now let’s include the mean of a continuous third variable. This sounds complex, but we shall approach it in a straightforward way using the dplyr package.
#the dplyr way
#we will first calculate our cross tabs and then the mean of transit usage
californiatod %>%
group_by(region, railtype) %>%
summarise(n = n(),
AvgTransit = mean(transit))
## # A tibble: 7 x 4
## # Groups: region [?]
## region railtype n AvgTransit
## <chr> <chr> <int> <dbl>
## 1 Bay Area Heavy or commuter rail 12 0.373
## 2 Bay Area Light rail 1 0.
## 3 LA Heavy or commuter rail 4 0.209
## 4 LA Light rail 2 0.159
## 5 Sacramento Light rail 2 0.190
## 6 SD Heavy or commuter rail 1 0.154
## 7 SD Light rail 4 0.102
californiatod %>%
group_by(region, railtype) %>%
skim()
## Skim summary statistics
## n obs: 26
## n variables: 6
## group variables: region, railtype
##
## Variable type: character
## region railtype variable missing complete n min max
## Bay Area Heavy or commuter rail name 0 12 12 12 37
## Bay Area Light rail name 0 1 1 17 17
## LA Heavy or commuter rail name 0 4 4 14 18
## LA Light rail name 0 2 2 13 19
## Sacramento Light rail name 0 2 2 13 14
## SD Heavy or commuter rail name 0 1 1 18 18
## SD Light rail name 0 4 4 18 20
## empty n_unique
## 0 12
## 0 1
## 0 4
## 0 2
## 0 2
## 0 1
## 0 4
##
## Variable type: numeric
## region railtype variable missing complete n mean
## Bay Area Heavy or commuter rail density 0 12 12 4.49
## Bay Area Heavy or commuter rail houseval 0 12 12 364959.27
## Bay Area Heavy or commuter rail transit 0 12 12 0.37
## Bay Area Light rail density 0 1 1 4
## Bay Area Light rail houseval 0 1 1 357694.13
## Bay Area Light rail transit 0 1 1 0
## LA Heavy or commuter rail density 0 4 4 10.54
## LA Heavy or commuter rail houseval 0 4 4 280882.23
## LA Heavy or commuter rail transit 0 4 4 0.21
## LA Light rail density 0 2 2 10.19
## LA Light rail houseval 0 2 2 117828.86
## LA Light rail transit 0 2 2 0.16
## Sacramento Light rail density 0 2 2 2.41
## Sacramento Light rail houseval 0 2 2 120943.04
## Sacramento Light rail transit 0 2 2 0.19
## SD Heavy or commuter rail density 0 1 1 1.72
## SD Heavy or commuter rail houseval 0 1 1 273004.39
## SD Heavy or commuter rail transit 0 1 1 0.15
## SD Light rail density 0 4 4 3.11
## SD Light rail houseval 0 4 4 181184.87
## SD Light rail transit 0 4 4 0.1
## sd p0 p25 p50 p75 p100 hist
## 0.94 3.15 3.81 4.68 4.81 6.49 ▅▂▁▇▃▁▁▂
## 192993.8 192593.26 259899.85 271141.66 385465.05 779791.98 ▃▇▁▁▁▁▁▁
## 0.2 0.067 0.17 0.44 0.51 0.64 ▇▇▁▃▃▇▇▇
## NA 4 4 4 4 4 ▁▁▁▇▁▁▁▁
## NA 357694.13 357694.13 357694.13 357694.13 357694.13 ▁▁▁▇▁▁▁▁
## NA 0 0 0 0 0 ▁▁▁▇▁▁▁▁
## 5.31 3.84 7.48 11.73 14.78 14.85 ▃▁▁▃▁▁▁▇
## 67265.85 216780.07 238942.82 267639.55 309578.97 371469.76 ▇▇▁▇▁▁▁▇
## 0.094 0.11 0.17 0.2 0.24 0.33 ▇▁▇▇▁▁▁▇
## 0.2 10.05 10.12 10.19 10.26 10.33 ▇▁▁▁▁▁▁▇
## 810.22 117255.94 117542.4 117828.86 118115.31 118401.77 ▇▁▁▁▁▁▁▇
## 0.22 0 0.079 0.16 0.24 0.32 ▇▁▁▁▁▁▁▇
## 0.071 2.36 2.38 2.41 2.44 2.46 ▇▁▁▁▁▁▁▇
## 45109.59 89045.74 1e+05 120943.04 136891.68 152840.33 ▇▁▁▁▁▁▁▇
## 0.045 0.16 0.17 0.19 0.21 0.22 ▇▁▁▁▁▁▁▇
## NA 1.72 1.72 1.72 1.72 1.72 ▁▁▁▇▁▁▁▁
## NA 273004.39 273004.39 273004.39 273004.39 273004.39 ▁▁▁▇▁▁▁▁
## NA 0.15 0.15 0.15 0.15 0.15 ▁▁▁▇▁▁▁▁
## 0.7 2.18 2.92 3.19 3.37 3.87 ▃▁▁▁▇▁▁▃
## 40061.18 121785.96 175001.31 2e+05 2e+05 207159.11 ▃▁▁▁▁▁▃▇
## 0.078 0 0.081 0.11 0.13 0.19 ▃▁▁▁▇▁▁▃
The dplyr package is designed to perform table operations on dataframes for multiple kinds of data manipulation. The preceding syntax told R to take our californiatod dataframe, group by the variables for region and railtype, and the calculate the mean transit usage from there. This is similar to the kind of pivot table operations once can do in excel.